Olympic Data
1 Introduction
Team 010100 are the following members: Izzy Illari, Lucia Illari, Omar Qusous, and Lydia Teinfalt. You may find our work over on GitHub.
For the second portion of our group project, we kept Olympics data from the EDA. Our SMART questions were What factors can be used to model the probability of being awarded a medal? What groups/clusters do athletes of different sports fall into? How does a pandemic affect the medals awarded? How can the evolution of athlete characteristics over time be modelled? With these questions in mind we went to see if we could find use the data on Olympians to find patterns and create models that could answer the questions.
We used a dataset called 120 years of Olympic history: athletes and results on Kaggle over here: https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results. This historical dataset includes all Olympic Games from Athens 1896 to Rio 2016, which was scraped from https://www.sports-reference.com/. We focused on data from Olympic events 1960-2016 when looking at clustering, Kmeans, Linear and Logit Regression and trends over time. For the pandemic analysis, we focused on data of Olympics participating in events before and after the H1N1 Pandemic from 1918-1919.
The report is organized as follows:
- Summary of Dataset
- Data Prep
- EDA
- Clustering, Kmeans, Kmedoids
- Linear and Logit Regression
- Random Forest
- Pandemic (Spanish Flu)
- Trends over time
- Summary and Conclusion
- References
2 Summary of Dataset
The data looks like the following:
'data.frame': 151977 obs. of 24 variables:
$ NOC : Factor w/ 122 levels "AFG","ALB","AND",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Year : int 1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
$ Decade : Factor w/ 6 levels "1960s","1970s",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ID : Factor w/ 74771 levels "1","2","6","7",..: 32644 32479 60453 32515 70738 16344 21919 59125 70738 32153 ...
$ First.Name : Factor w/ 14118 levels "","A","A.","Aadam",..: 8716 3731 64 599 64 11978 64 4634 64 8716 ...
$ Name : Factor w/ 74268 levels " Gabrielle Marie \"Gabby\" Adcock (White-)",..: 48941 19066 219 3341 221 64833 216 23793 221 48946 ...
$ Last.Name : Factor w/ 47370 levels "","-)","-Alard)",..: 23228 23112 38893 23137 44908 13260 16633 37860 44908 22890 ...
$ Sex : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
$ Age : int 24 18 20 35 20 28 22 23 20 20 ...
$ Height : int 171 162 178 166 179 168 172 170 179 166 ...
$ Weight : num 78 52 68 66 75 73 70 58 75 62 ...
$ BMI : num 26.7 19.8 21.5 24 23.4 ...
$ BMI.Category: Factor w/ 5 levels "0","1","2","3",..: 4 1 3 3 3 4 3 3 3 3 ...
$ Team : Factor w/ 332 levels "Acipactli","Afghanistan",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Population : int 8996973 8996973 8996973 8996973 8996973 8996973 8996973 8996973 8996973 8996973 ...
$ GDP : num 5.38e+08 5.38e+08 5.38e+08 5.38e+08 5.38e+08 ...
$ GDPpC : num 59.8 59.8 59.8 59.8 59.8 ...
$ Games : Factor w/ 30 levels "1960 Summer",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Season : Factor w/ 2 levels "Summer","Winter": 1 1 1 1 1 1 1 1 1 1 ...
$ City : Factor w/ 29 levels "Albertville",..: 19 19 19 19 19 19 19 19 19 19 ...
$ Sport : Factor w/ 51 levels "Alpine Skiing",..: 51 51 3 51 3 51 3 3 3 51 ...
$ Event : Factor w/ 489 levels "Alpine Skiing Men's Combined",..: 478 468 17 476 33 482 22 24 18 466 ...
$ Medal : Factor w/ 4 levels "Bronze","Gold",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Medal.No.Yes: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
The athlete events data has 24 columns and 151977 rows/entries, for a total of 3647448 individual data points. In olympic_data each row corresponds to an individual athlete competing in an individual Olympic event. The variables are the following:
- ID: Unique number for each athlete
- Name: Athlete’s name
- Sex: M or F
- Age: Integer
- Height: centimeters
- Weight: kilograms
- Team: Team name
- NOC: National Olympic Committee 3-letter code
- Games: Year and season
- Year: Integer
- Season: Summer or Winter
- City: Host city
- Sport
- Event
- Medal: Gold, Silver, Bronze, or NA
To prepare our data for EDA we dropped the Olympic event: Art Sculpting. NAs were also removed. We have modified the data from the kaggle dataset from which it was originally taken. The dataset now starts at 1960 and includes the new following variables:
- Decade (factor)
- First name (factor)
- Last name (factor)
- BMI (numeric)
- BMI category (factor)
- Population (numeric)
- GDP (numeric)
- GDPpC (numeric)
- Medal: Yes or No (factor)
3 EDA
For EDA, we can do a quick summary to just look at the data.
| NOC | Year | Decade | ID | First.Name | Name | Last.Name | Sex | Age | Height | Weight | BMI | BMI.Category | Team | Population | GDP | GDPpC | Games | Season | City | Sport | Event | Medal | Medal.No.Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | USA :12218 | 1960 | 1960s:14506 | 94406 : 30 | John : 1102 | Michael Fred Phelps, II: 30 | Jr. : 488 | F:46954 | 12.00 | 127.0 | 28.00 | 10.50 | 0:16785 | United States:11748 | 9.891e+03 | 3.029e+07 | 59.5 | 2000 Summer:10621 | Summer:105729 | Sydney :10621 | Athletics :18072 | Ice Hockey Men’s Ice Hockey : 2399 | Bronze : 6116 | 0:115990 |
| Q1 | CAN : 7527 | 1984 | 1970s:11261 | 11951 : 27 | David : 995 | Ole Einar Bjrndalen : 27 | Smith : 269 | M:87011 | 21.00 | 168.0 | 60.00 | 20.90 | 1: 3311 | Canada : 7250 | 1.028e+07 | 6.500e+10 | 2759.0 | 2008 Summer:10568 | Winter: 28236 | Beijing :10568 | Swimming :12767 | Hockey Men’s Hockey : 2024 | Gold : 5941 | 1: 17975 |
| Median | ITA : 7419 | 1998 | 1980s:19330 | 91845 : 25 | Michael: 923 | Yang Wei : 26 | Garca : 256 | NA | 24.00 | 175.0 | 70.00 | 22.50 | 2:92817 | France : 7201 | 4.208e+07 | 3.300e+11 | 10586.0 | 2016 Summer:10448 | NA | Rio de Janeiro:10448 | Gymnastics :10437 | Football Men’s Football : 1996 | No Medal:115990 | NA |
| Mean | FRA : 7358 | 1995 | 1990s:23126 | 12678 : 24 | Kim : 882 | Gabriella Paruzzi : 25 | Silva : 249 | NA | 25.14 | 175.3 | 70.47 | 22.73 | 3:18254 | Italy : 7174 | 1.132e+08 | 1.490e+12 | 16809.3 | 2004 Summer:10120 | NA | Athina :10120 | Cross Country Skiing: 5434 | Basketball Men’s Basketball : 1469 | Silver : 5918 | NA |
| Q3 | JPN : 7219 | 2008 | 2000s:38099 | 14170 : 24 | Robert : 808 | Lee Ju-Hyeong : 25 | Rodrguez: 219 | NA | 28.00 | 183.0 | 79.00 | 24.20 | 4: 2798 | Japan : 7071 | 9.315e+07 | 1.340e+12 | 26401.8 | 2012 Summer: 9904 | NA | London : 9904 | Cycling : 5044 | Cycling Men’s Road Race, Individual: 1306 | NA | NA |
| Max | GBR : 7169 | 2016 | 2010s:27643 | 15991 : 24 | Jos : 760 | Alberto Busnari : 24 | Gonzlez : 217 | NA | 71.00 | 226.0 | 214.00 | 63.90 | NA | Great Britain: 6915 | 1.380e+09 | 1.870e+13 | 178846.0 | 1996 Summer: 9102 | NA | Atlanta : 9102 | Rowing : 4712 | Water Polo Men’s Water Polo : 1286 | NA | NA |
| NA | (Other):85055 | NA | NA | (Other):133811 | (Other):128495 | (Other) :133808 | (Other) :132267 | NA | NA | NA | NA | NA | NA | (Other) :86606 | NA | NA | NA | (Other) :73202 | NA | (Other) :73202 | (Other) :77499 | (Other) :123485 | NA | NA |
4 Olympics Correlation plot
Just quickly visualizing thw correlation will be useful for model building, but we have to be mindful of the fact that columns such as Medal and Medal.No.Yes are noturally going to be highly correlated.
It might be more useful to focus in on the correlations for only the variable Medal.No.Yes and Medal:
| rowname | Medal.No.Yes |
|---|---|
| GDP | 0.1414144 |
| GDPpC | 0.0884261 |
| Height | 0.0831910 |
| Weight | 0.0771523 |
| Population | 0.0756385 |
| BMI.Category | 0.0722161 |
| NOC | 0.0663156 |
| Team | 0.0620423 |
| Event | 0.0551700 |
| Sport | 0.0550994 |
| Year | 0.0520386 |
| Decade | 0.0517238 |
| Games | 0.0513962 |
| BMI | 0.0418939 |
| Age | 0.0314189 |
| ID | 0.0105085 |
| First.Name | 0.0044511 |
| Name | 0.0042080 |
| Last.Name | -0.0067437 |
| City | -0.0180115 |
| Sex | -0.0305938 |
| Season | -0.0420744 |
| Medal | -0.4586621 |
| rowname | Medal |
|---|---|
| Season | 0.0237157 |
| City | 0.0108140 |
| Sex | 0.0083743 |
| Last.Name | 0.0032221 |
| Name | -0.0017224 |
| First.Name | -0.0017838 |
| ID | -0.0057542 |
| Age | -0.0150506 |
| Games | -0.0204830 |
| Decade | -0.0208330 |
| Year | -0.0209759 |
| Team | -0.0211824 |
| BMI | -0.0230238 |
| Event | -0.0268180 |
| Sport | -0.0271357 |
| Population | -0.0276393 |
| NOC | -0.0294776 |
| BMI.Category | -0.0364803 |
| GDPpC | -0.0372752 |
| Weight | -0.0380738 |
| Height | -0.0381812 |
| GDP | -0.0600564 |
| Medal.No.Yes | -0.4586621 |
Unless you ignore the athletes that didn’t receive a medal, building a general model from the variable Medal.No.Yes might be a better idea, based off the strength of the correlations.
5 Clustering, Kmeans, Kmedoids
My first thought was to do some clustering with just the numeric columns originally present in the data, namely Age, Weight, and Height, so I decided to look at some 3D scatter plots.
So we are indeed seeing different behavior with these two sports - Triathlon appears more spread out but Softball appears to mostly clustered around lower ages. Well first thing’s first, before we just go ahead seeing what clusters there are, we should calculate the Hopkin’s Statistic. We can conduct the Hopkins Statistic test iteratively, using 0.5 as the threshold to reject the alternative hypothesis. That is, if H < 0.5, then it is unlikely that D has statistically significant clusters. Put in other words, If the value of Hopkins statistic is close to 1, then we can reject the null hypothesis and conclude that the dataset D is significantly a clusterable data. We need to make sure to remove NAs and scale the variables to make them comparable. Scaling consists of transforming the variables such that they have mean zero and standard deviation one.
[1] "Triathlon without Population and GDP"
Hopkins Statistic H = 0.7551693
[1] "Triathlon with Population and GDP"
Hopkins Statistic H = 0.8262848
[1] "Softball without Population and GDP"
Hopkins Statistic H = 0.7773288
[1] "Softball with Population and GDP"
Hopkins Statistic H = 0.7967403
Clearly all of these values are greater than 0.5, so there are statistically significant clusters present. Of course, Kmeans (and Kmedoids) requires us to specify “how many \(k\), i.e., clusters?” We can use the elbow method, the silhouette method, and the gap statistic to get an idea for how many \(k\) we should be specifying.
For the Triathlon data with only variables Age, Height, and Weight, for Kmeans and Kmedoids, we’ll try \(k\)=2, and for the Triathlon data that additionally has Population and GDP, we’ll try \(k\)=3 and \(k\)=7.
This is interesting for the Softbll. Though according to the Hopkin’s statistic there were statistically significant clusters, \(k\)=1 keeps being suggested. There is likely to be a lot of overlapping data points in the clusters when $k$2. I can try \(k\)=2 for Softball with only Age, Weight, and Height, and \(k\)=3 for Kmedoids and \(k\)=6 for Kmeans for Population and GDP in addition.
| Dim.1 | Dim.2 | Dim.3 | |
|---|---|---|---|
| Age | 0.2739968 | 99.3952316 | 0.3307716 |
| Height | 50.0136554 | 0.0013354 | 49.9850093 |
| Weight | 49.7123478 | 0.6034330 | 49.6842192 |
# A tibble: 2 x 4
Cluster Age Height Weight
<int> <dbl> <dbl> <dbl>
1 1 28.2 166. 54.2
2 2 27.5 180. 68.4
# A tibble: 2 x 4
Cluster Age Height Weight
<int> <dbl> <dbl> <dbl>
1 1 27.4 180. 68.1
2 2 28.4 166. 53.8
I will only print out the means for the clustering here with the most distinct clusters, which is \(k\)=3 using Kmeans.
| Dim.1 | Dim.2 | Dim.3 | Dim.4 | Dim.5 | |
|---|---|---|---|---|---|
| Age | 0.2738287 | 0.3359204 | 89.3319869 | 9.5647119 | 0.4935521 |
| Height | 49.5850049 | 0.5009423 | 0.0013683 | 0.1849146 | 49.7277699 |
| Weight | 49.4872272 | 0.2153672 | 0.3818564 | 0.6547254 | 49.2608238 |
| Population | 0.2345734 | 50.8194590 | 3.1923196 | 45.3527555 | 0.4008925 |
| GDP | 0.4193659 | 48.1283110 | 7.0924688 | 44.2428925 | 0.1169618 |
# A tibble: 3 x 6
Cluster Age Height Weight Population GDP
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 28.3 175. 63.7 590769231. 1.19e13
2 2 27.4 180. 68.2 52221408. 1.11e12
3 3 28.1 166. 53.7 59387371. 1.56e12
We use Population and GDP here as a proxy for Team, since Team is not a continuous variable. Yes, we could make variables such as Team numeric, and use it that way, but what woudl a cluster with a mean Team = 2.5 mean? How is an athlete “inbetween” a team? In this way we can get a sense for the Teams while using continuous variables. For example, cluster 3, the only countries that meet this GDP requirement are China and the US, so there is a distinct cluster for the Triathlon data made up of American and Chinese athlete. In fact, if we look at all the GDP values, they are all relatively high, and correspond to countries like Australia, Canada, etc, which tells us that these clusters are all made of athletes from rather rich countries.
Moving on to softball:
| Dim.1 | Dim.2 | Dim.3 | |
|---|---|---|---|
| Age | 1.59669 | 98.0171460 | 0.3861638 |
| Height | 49.58696 | 0.2061233 | 50.2069183 |
| Weight | 48.81635 | 1.7767307 | 49.4069180 |
# A tibble: 2 x 4
Cluster Age Height Weight
<int> <dbl> <dbl> <dbl>
1 1 25.8 167. 64.3
2 2 28.4 177. 77.3
# A tibble: 2 x 4
Cluster Age Height Weight
<int> <dbl> <dbl> <dbl>
1 1 28.4 175. 74.6
2 2 25.1 166. 63.2
Using \(k\)=2 wasn’t actually too bad! And Kmeans and Kmedoids appears to have recovered very similar centers. Moving on to the Softball data with Population and GDP, technically \(k\)=3 was for Kmedoids and \(k\)=6 was for Kmeans, but I will try both cluster sizes for both methods.
| Dim.1 | Dim.2 | Dim.3 | Dim.4 | Dim.5 | |
|---|---|---|---|---|---|
| Age | 0.4196026 | 44.1563110 | 35.479383 | 19.616889 | 0.327814 |
| Height | 41.5173922 | 0.9554527 | 1.662170 | 10.720112 | 45.144873 |
| Weight | 43.2316997 | 1.9327517 | 2.825011 | 2.867047 | 49.143491 |
| Population | 0.8178199 | 38.7382345 | 55.625875 | 1.149520 | 3.668550 |
| GDP | 14.0134856 | 14.2172501 | 4.407561 | 65.646431 | 1.715272 |
As we can see, the clusters are all on top of each other. \(k\)=3 seems to be roughly the same clusters using Kmeans or Kmedoids, but \(k\)=6 appears to have given very different results depending on the method. I will not print out the means/medoids here, since the clusters are on top of each other and not as distinct. We could investigate what these clusters look like when plotted for the different prinicpal axes:
but the clusters are still on top of each other. It appears for some sports adding Population and GDP can help define new clusters, but for other sports, it muddies them. It is possible other clustering methods may reveal clusters not found using Kmeans and Kmedoids.
6 Logit Regression
6.1 Linear Model
We added a new column, TMedals, to the Olympic dataset to hold the total number of medals earned for event year. The purpose of this new numeric column is to be able to build linear models.
Based on the correlation plot, GDP, Population, Height, and Sport.Int are the features with the highest correlation. Four models will be built, adding each feature to find the best fitted model. We can ignore Year and Decade showing high correlation with total number of medals because the field as created based on the olympic year.
Call:
lm(formula = dyn(TMedals ~ GDP + Population + Height + Sport.Int),
data = new_data)
Residuals:
Min 1Q Median 3Q Max
-2242.2 -1094.1 219.7 918.3 1512.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.128e+03 1.205e+02 9.364 < 2e-16 ***
GDP 4.436e-11 1.855e-12 23.915 < 2e-16 ***
Population 2.313e-07 2.808e-08 8.237 < 2e-16 ***
Height 5.518e+00 6.742e-01 8.184 2.92e-16 ***
Sport.Int 3.495e+00 4.973e-01 7.028 2.17e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 995.5 on 17970 degrees of freedom
Multiple R-squared: 0.05487, Adjusted R-squared: 0.05466
F-statistic: 260.8 on 4 and 17970 DF, p-value: < 2.2e-16
| Adjusted R2 | |
|---|---|
| Linear Model 1: TMedals ~ GDP | 0.0454538 |
| Linear Model 2: TMedals ~ GDP + Population | 0.0484146 |
| Linear Model 3: TMedals ~ GDP + Population + Height | 0.0521113 |
| Linear Model 4: TMedals ~ GDP + Population + Height + Sport.Int | 0.0546568 |
| VIF | |
|---|---|
| GDP | 1.1481 |
| Population | 1.1740 |
| Height | 1.0270 |
| Sport.Int | 1.0099 |
According to the Adjusted \(R^2\) value, model 4 is the best fit. The coefficients’ p-values for (Intercept), GDP, Population, Height, and Sport are less than significance level \(\alpha\) = 0.05, making them statistically significant. The VIF values for GDP, Population, Height and Sport are greater than 1 but less than 5 so multicollinearity is not an issue with this model.
Analysis of Variance Table
Model 1: TMedals ~ GDP
Model 2: TMedals ~ GDP + Population
Model 3: TMedals ~ GDP + Population + Height
Model 4: TMedals ~ GDP + Population + Height + Sport.Int
Res.Df RSS Df Sum of Sq F Pr(>F)
1 17973 1.7985e+10
2 17972 1.7928e+10 1 56784414 57.299 3.925e-14 ***
3 17971 1.7858e+10 1 70640998 71.281 < 2.2e-16 ***
4 17970 1.7809e+10 1 48946153 49.389 2.174e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Res.Df RSS Df Sum of Sq
Min. :17970 Min. :1.781e+10 Min. :1 Min. :48946153
1st Qu.:17971 1st Qu.:1.785e+10 1st Qu.:1 1st Qu.:52865283
Median :17972 Median :1.789e+10 Median :1 Median :56784414
Mean :17972 Mean :1.789e+10 Mean :1 Mean :58790521
3rd Qu.:17972 3rd Qu.:1.794e+10 3rd Qu.:1 3rd Qu.:63712706
Max. :17973 Max. :1.799e+10 Max. :1 Max. :70640998
NA's :1 NA's :1
F Pr(>F)
Min. :49.39 Min. :0
1st Qu.:53.34 1st Qu.:0
Median :57.30 Median :0
Mean :59.32 Mean :0
3rd Qu.:64.29 3rd Qu.:0
Max. :71.28 Max. :0
NA's :1 NA's :1
An Anova test was used to compare the four models. The p-values are less then standard \(\alpha\) = 0.05 so the models differ significantly. There is enough variance that you can reject the null hypothesis that the models are the same.
6.2 Logit Regression: Factors influencing Earning Olympic Medals
We used logit regression to model what factors influence the chances of an athlete receiving a medal, Medal.No.Yes.
Call:
glm(formula = Medal.No.Yes ~ Sex.Int + Height + Sport.Int, family = binomial(link = "logit"),
data = ol_dt_subset)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0290 -0.5688 -0.5000 -0.4327 2.6067
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.3041122 0.1427563 -51.16 <2e-16 ***
Sex.Int -0.5789713 0.0196270 -29.50 <2e-16 ***
Height 0.0349904 0.0008865 39.47 <2e-16 ***
Sport.Int 0.0094889 0.0005236 18.12 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 105631 on 133964 degrees of freedom
Residual deviance: 103524 on 133961 degrees of freedom
AIC: 103532
Number of Fisher Scoring iterations: 4
(Intercept) Sex.Int Height Sport.Int
0.0006727665 0.5604746546 1.0356098076 1.0095340681
2.5 % 97.5 %
(Intercept) -7.583909308 -7.02431508
Sex.Int -0.617439435 -0.54050308
Height 0.033252845 0.03672803
Sport.Int 0.008462602 0.01051521
fitting null model for pseudo-r2
The McFadden (part of the pseudo-R\(^2\) statistics) value of 0.02 also shows the model is not a particularly great model, with only 2% of the variation explained.
6.3 Logistic Regression: Medal.No.Yes variable
Now we have already looked at the correlation matrix for the entire dataset, and it looks like Medal.No.Yes might have some stronger correlations than Medal. We’ll be attempting to use the entire dataset for logistic regression to see what variables go into predicting whether an athlete is awarded a medal or not.
Call:
glm(formula = Medal.No.Yes ~ GDP + Height + Weight + Population +
Sport, family = binomial(link = "logit"), data = olympic.data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4937 -0.5543 -0.4351 -0.3488 2.6108
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.560e+00 2.189e-01 -20.830 < 2e-16 ***
GDP 8.836e-14 2.282e-15 38.719 < 2e-16 ***
Height 8.280e-03 1.527e-03 5.421 5.94e-08 ***
Weight 1.373e-03 1.068e-03 1.285 0.198737
Population 3.169e-10 3.116e-11 10.169 < 2e-16 ***
SportArchery 8.303e-01 1.052e-01 7.894 2.93e-15 ***
SportAthletics 5.222e-01 7.130e-02 7.324 2.40e-13 ***
SportBadminton 6.829e-01 1.248e-01 5.471 4.46e-08 ***
SportBaseball 2.654e+00 1.052e-01 25.215 < 2e-16 ***
SportBasketball 1.608e+00 8.274e-02 19.431 < 2e-16 ***
SportBeach Volleyball 7.545e-01 1.622e-01 4.652 3.28e-06 ***
SportBiathlon 2.011e-01 9.511e-02 2.114 0.034499 *
SportBobsleigh 1.173e-01 1.242e-01 0.945 0.344823
SportBoxing 1.308e+00 8.421e-02 15.529 < 2e-16 ***
SportCanoeing 1.012e+00 8.146e-02 12.418 < 2e-16 ***
[ reached getOption("max.print") -- omitted 40 rows ]
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 105748 on 134304 degrees of freedom
Residual deviance: 98409 on 134250 degrees of freedom
(17672 observations deleted due to missingness)
AIC: 98519
Number of Fisher Scoring iterations: 5
Predicted 0 Predicted 1 Total
Actual 0 115863 462 116325
Actual 1 17429 551 17980
Total 133292 1013 134305
fitting null model for pseudo-r2
McFadden
0.06939646
Area under the curve: 0.6884
Team was not included because it was not significant however the overall model is significant. For the area under the curve, it is FALSE that it is more than 0.8, so this is not a good model. The true negative percentage was 99.6028369% and the true positive percentage was 3.0645161% so it appears that the model is mostly labelling everything as not receiving a medal.
Let’s try doing a model for Basketball, one of the more popular sports by athlete counts. Now I’m going to include Age, Weight, and Height and first, but there is something interesting that happens:
Call:
glm(formula = Medal.No.Yes ~ Age + Weight + Height + GDP + Population +
Team, family = binomial(link = "logit"), data = basket1)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.40839 -0.61712 -0.00009 0.00001 2.63631
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.709e+00 1.865e+00 1.989 0.046732 *
Age -1.421e-02 1.668e-02 -0.852 0.394215
Weight -1.961e-02 1.006e-02 -1.950 0.051226 .
Height -4.557e-03 1.280e-02 -0.356 0.721886
GDP 5.767e-13 1.057e-13 5.458 4.83e-08 ***
Population -2.642e-08 3.426e-09 -7.710 1.26e-14 ***
TeamAustralia -1.747e+00 3.325e-01 -5.254 1.49e-07 ***
TeamBelarus -2.039e+01 2.285e+03 -0.009 0.992883
TeamBrazil 1.347e+00 4.547e-01 2.964 0.003041 **
TeamCanada -2.016e+01 9.298e+02 -0.022 0.982702
TeamCentral African Republic -2.033e+01 3.094e+03 -0.007 0.994757
TeamChina 2.678e+01 3.628e+00 7.383 1.55e-13 ***
TeamCongo (Kinshasa) -1.955e+01 3.238e+03 -0.006 0.995183
TeamCuba -2.445e+00 4.357e-01 -5.612 2.00e-08 ***
TeamCzech Republic -2.046e+01 1.787e+03 -0.011 0.990866
[ reached getOption("max.print") -- omitted 27 rows ]
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2840.1 on 2451 degrees of freedom
Residual deviance: 1416.6 on 2410 degrees of freedom
(216 observations deleted due to missingness)
AIC: 1500.6
Number of Fisher Scoring iterations: 18
fitting null model for pseudo-r2
McFadden
0.5012314
Predicted 0 Predicted 1 Total
Actual 0 1773 27 1800
Actual 1 304 348 652
Total 2077 375 2452
Area under the curve: 0.9089
Looking at this, we see that the variables Age, Height, and Weight are not significant, but GDP, Population, and Team are! Let’s go ahead and see what the model looks like without these variables…
Call:
glm(formula = Medal.No.Yes ~ GDP + Population + Team, family = binomial(link = "logit"),
data = basket1)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.32386 -0.64567 -0.00008 0.00000 2.40026
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.522e-01 2.831e-01 1.597 0.11016
GDP 5.880e-13 1.038e-13 5.667 1.45e-08 ***
Population -2.716e-08 3.344e-09 -8.120 4.65e-16 ***
TeamAustralia -1.452e+00 3.164e-01 -4.590 4.42e-06 ***
TeamBelarus -1.979e+01 2.293e+03 -0.009 0.99311
TeamBrazil 1.802e+00 4.387e-01 4.109 3.98e-05 ***
TeamCanada -1.966e+01 9.368e+02 -0.021 0.98325
TeamCentral African Republic -1.995e+01 3.104e+03 -0.006 0.99487
TeamChina 2.798e+01 3.542e+00 7.900 2.79e-15 ***
TeamCongo (Kinshasa) -1.886e+01 3.104e+03 -0.006 0.99515
TeamCuba -1.980e+00 4.146e-01 -4.776 1.79e-06 ***
TeamCzech Republic -1.985e+01 1.792e+03 -0.011 0.99116
TeamEgypt -1.883e+01 1.618e+03 -0.012 0.99072
TeamFinland -1.990e+01 3.104e+03 -0.006 0.99489
TeamFrance -9.292e-01 3.708e-01 -2.506 0.01221 *
[ reached getOption("max.print") -- omitted 24 rows ]
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2899.5 on 2535 degrees of freedom
Residual deviance: 1468.0 on 2497 degrees of freedom
(132 observations deleted due to missingness)
AIC: 1546
Number of Fisher Scoring iterations: 18
fitting null model for pseudo-r2
McFadden
0.4937136
Predicted 0 Predicted 1 Total
Actual 0 1857 23 1880
Actual 1 297 359 656
Total 2154 382 2536
Area under the curve: 0.9065
The area under the curve is hardly affected - it’s a 0.0023698 difference! Moreover, the model with just GDP, Population, and Team is an overall significant model, and it is TRUE that the area under the curve is more than 0.8. In fact, the area under the curve is 0.0934954 away from being 1, so this would appear to be a very good model. The true negative percentage was 98.7765957% and the true positive percentage was 54.7256098% so this model is much better at predicting when an athlete will recieve a medal. Of course, we would like to see the TP value be higher, but it is interesting how Basketball is a sport that can be modelled by variables related to the country the athletes come from, rather than characteristics describing the athletes themselves…
Softball (and Baseball) were removed from the Olympics because the USA and Japan dominated the sport at the Olympics, let’s see what variables affect the logistical regression model for Softball then.
Call:
glm(formula = Medal.No.Yes ~ Age + Weight + Height + GDP + Population +
Team, family = binomial(link = "logit"), data = soft)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.83395 -0.00004 -0.00001 0.00003 1.32990
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.628e+01 3.507e+03 0.005 0.99630
Age 1.208e-01 8.699e-02 1.389 0.16482
Weight 1.690e-02 4.542e-02 0.372 0.70988
Height 2.172e-02 6.557e-02 0.331 0.74043
GDP 5.809e-12 2.946e-12 1.972 0.04864 *
Population -3.061e-07 1.151e-07 -2.658 0.00785 **
TeamCanada -4.264e+01 4.561e+03 -0.009 0.99254
TeamChina 3.485e+02 3.510e+03 0.099 0.92091
TeamCuba -4.304e+01 8.169e+03 -0.005 0.99580
TeamItaly -3.772e+01 5.796e+03 -0.007 0.99481
TeamJapan -1.238e+01 3.507e+03 -0.004 0.99718
TeamNew Zealand -4.548e+01 8.087e+03 -0.006 0.99551
TeamUnited States 2.988e+01 4.195e+03 0.007 0.99432
TeamVenezuela -3.924e+01 8.033e+03 -0.005 0.99610
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 507.328 on 366 degrees of freedom
Residual deviance: 66.672 on 353 degrees of freedom
(7 observations deleted due to missingness)
AIC: 94.672
Number of Fisher Scoring iterations: 20
fitting null model for pseudo-r2
McFadden
0.8685814
Predicted 0 Predicted 1 Total
Actual 0 180 15 195
Actual 1 3 169 172
Total 183 184 367
Area under the curve: 0.993
Look at these results! Age, Height and Weight have no significance, and GDP is at the * level and population is at the ** level. The area under the curve is 0.9929636 and The true negative percentage was 92.3076923% and the true positive percentage was 98.255814%.
Let’s only keep Population and Team and see what happens…
Call:
glm(formula = Medal.No.Yes ~ Population + Team, family = binomial(link = "logit"),
data = soft)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.75107 -0.00003 -0.00002 0.00003 0.80789
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.491e+01 3.768e+03 0.007 0.994726
Population -1.689e-07 5.047e-08 -3.346 0.000819 ***
TeamCanada -4.120e+01 5.327e+03 -0.008 0.993829
TeamChina 1.845e+02 3.769e+03 0.049 0.960953
TeamCuba -4.460e+01 8.436e+03 -0.005 0.995782
TeamItaly -3.680e+01 6.533e+03 -0.006 0.995506
TeamJapan -2.340e+00 3.768e+03 -0.001 0.999505
TeamNew Zealand -4.582e+01 8.436e+03 -0.005 0.995666
TeamUnited States 4.655e+01 5.049e+03 0.009 0.992644
TeamVenezuela -4.181e+01 8.436e+03 -0.005 0.996046
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 517.79 on 373 degrees of freedom
Residual deviance: 75.40 on 364 degrees of freedom
AIC: 95.4
Number of Fisher Scoring iterations: 20
fitting null model for pseudo-r2
McFadden
0.8543811
Predicted 0 Predicted 1 Total
Actual 0 180 15 195
Actual 1 0 179 179
Total 180 194 374
Area under the curve: 0.9811
The area under the curve is ‘r pROC::auc(h.s2)’ and the true negative percentage was 92.3076923% and the true positive percentage was 100%. Softball (and baseball) were removed from the Olympics because the USA and Japan dominated the sport at the Olympics, and it is clear here, looking at these models, that you can predict the chance of being awarded a model based solely on where the team was coming from. Of course, removing Softball from the Olympics had far reaching, negative impacts that are still felt by the sport years later… But it makes sense looking at these logistical regression models why there was felt a need to remove Softball and Baseball from the Olympics.
For some sports, however, Age and Height do play a role, such as in Swimming, where Weight, GDP, and Population were not statistically significant:
Call:
glm(formula = Medal.No.Yes ~ Age + Weight + Height + GDP + Population +
Team, family = binomial(link = "logit"), data = swim)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8476 -0.4431 -0.2756 -0.0002 3.2767
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.337e+01 2.289e+03 -0.010 0.99185
Age 3.145e-02 9.750e-03 3.225 0.00126 **
Weight 4.687e-03 6.035e-03 0.777 0.43744
Height 2.121e-02 7.288e-03 2.911 0.00361 **
GDP -7.582e-15 1.597e-14 -0.475 0.63501
Population -9.145e-10 1.488e-09 -0.615 0.53873
TeamAndorra 1.976e-01 3.152e+03 0.000 0.99995
TeamArgentina 1.351e+01 2.289e+03 0.006 0.99529
TeamArmenia -1.016e-01 3.347e+03 0.000 0.99998
TeamAustralia 1.786e+01 2.289e+03 0.008 0.99377
TeamAustria 1.461e+01 2.289e+03 0.006 0.99491
TeamAzerbaijan 1.938e-01 3.147e+03 0.000 0.99995
TeamBahrain 3.710e-01 3.247e+03 0.000 0.99991
TeamBelarus 1.527e+01 2.289e+03 0.007 0.99468
TeamBelgium 1.448e+01 2.289e+03 0.006 0.99495
[ reached getOption("max.print") -- omitted 96 rows ]
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10328 on 12766 degrees of freedom
Residual deviance: 6736 on 12656 degrees of freedom
(1327 observations deleted due to missingness)
AIC: 6958
Number of Fisher Scoring iterations: 17
fitting null model for pseudo-r2
McFadden
0.347799
Predicted 0 Predicted 1 Total
Actual 0 10605 378 10983
Actual 1 910 874 1784
Total 11515 1252 12767
Area under the curve: 0.8801
So for different sports, there are different factors that go into modeling whether an athelete is awarded a medal or not, and for some, they have a “human element” not captured by the data, and for others, this dataset has all the information needed to accurately predict whether you will recieve a medal or not.
7 Predicting Model Winners Based on Athlete’s Information and Sport
In this section a classification model will be built using kNN and Random Forest that can predict if an athlete with certain characteristics can win a gold, silver, or bronze medal when they participate in the Olympics.
First step is to drop columns that maybe not useful. This is done based on domain knowldge as well as recognoizing that some factors will be co-linearly related. For example Gross Domestic Product per capita (GDPpC) of a country’s athlete will have a direct relationship to GDP and population. Another example is Weight, Height and BMI. The former two are included in BMI’s calculation.
NA values are all dropped. Attempting linear interpolation for GDPpC assumes that that is linear in its nature, however, as shown in the figure below it has cyclical characterisitc.
GDP growth (annual %) - Afghanistan, Indonesia, Jordan, Russian Federation, Mexico
Furthermore, estimating missing age, height and weight for athelets will result inaccurate data being input into models.
| NOC | Year | Decade | Sex | Age | BMI | BMI.Category | GDPpC | Season | Sport | Medal | Medal.No.Yes | Sex.Int | NOC.Int | Sport.Int |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AFG | 2008 | 2000s | M | 21 | 18.81215 | 2 | 364.6605 | Summer | Taekwondo | Bronze | 1 | 2 | 1 | 44 |
| AFG | 2012 | 2010s | M | 25 | 18.81215 | 2 | 641.8722 | Summer | Taekwondo | Bronze | 1 | 2 | 1 | 44 |
| ARG | 1964 | 1960s | M | 34 | 22.49135 | 2 | 1173.2382 | Summer | Equestrianism | Silver | 1 | 2 | 4 | 16 |
| ARG | 1968 | 1960s | M | 22 | 24.91077 | 2 | 1141.0806 | Summer | Boxing | Bronze | 1 | 2 | 4 | 10 |
| ARG | 1968 | 1960s | M | 24 | 25.99244 | 3 | 1141.0806 | Summer | Rowing | Bronze | 1 | 2 | 4 | 31 |
| ARG | 1972 | 1970s | M | 28 | 25.99244 | 3 | 1408.8652 | Summer | Rowing | Silver | 1 | 2 | 4 | 31 |
7.1 Random Forest
Data is split into train and test sets at 70% and 30% ratio respectively.
[1] 0.6999722
[1] 12582
[1] 5393
Below is feature selection method in Random Forest library to pick best features to produce highest accuracy using crossvalidation and Backward feature selection.
Accuracy vs Number of Variables
The results above indicate the features that give the best accuracy results are GDPpC, Sex, Decade and Sport.
Call:
randomForest(formula = Medal ~ GDPpC + Sex + Decade + Sport, data = data_train1, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 32.61%
Confusion matrix:
Bronze Gold Silver class.error
Bronze 2838 734 647 0.3273288
Gold 619 3007 552 0.2802776
Silver 785 766 2634 0.3706093
Initial model without any tuning gives the results below:
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
8.208552e-01 7.312837e-01 8.140410e-01 8.275207e-01 3.353203e-01
AccuracyPValue McnemarPValue
0.000000e+00 9.850543e-23
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
Class: Bronze 0.8343209 0.9128303 0.8284302 0.9161166 0.8284302
Class: Gold 0.8528004 0.8873156 0.7900222 0.9238107 0.7900222
Class: Silver 0.7753883 0.9311659 0.8488098 0.8926818 0.8488098
Recall F1 Prevalence Detection Rate
Class: Bronze 0.8343209 0.8313651 0.3353203 0.2797647
Class: Gold 0.8528004 0.8202118 0.3320617 0.2831823
Class: Silver 0.7753883 0.8104396 0.3326180 0.2579081
Detection Prevalence Balanced Accuracy
Class: Bronze 0.3377047 0.8735756
Class: Gold 0.3584486 0.8700580
Class: Silver 0.3038468 0.8532771
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
6.690154e-01 5.031396e-01 6.562748e-01 6.815729e-01 3.517523e-01
AccuracyPValue McnemarPValue
0.000000e+00 2.131689e-05
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
Class: Bronze 0.6700053 0.8292334 0.6804069 0.8224113 0.6804069
Class: Gold 0.7197958 0.8228650 0.6637029 0.8580868 0.6637029
Class: Silver 0.6162724 0.8510929 0.6621203 0.8240741 0.6621203
Recall F1 Prevalence Detection Rate
Class: Bronze 0.6700053 0.6751660 0.3517523 0.2356759
Class: Gold 0.7197958 0.6906122 0.3269052 0.2353050
Class: Silver 0.6162724 0.6383742 0.3213425 0.1980345
Detection Prevalence Balanced Accuracy
Class: Bronze 0.3463749 0.7496193
Class: Gold 0.3545337 0.7713304
Class: Silver 0.2990914 0.7336826
Reference
Prediction Bronze Gold Silver
Bronze 1271 268 329
Gold 307 1269 336
Silver 319 226 1068
Model Parameter Tunning:
Default number of trees built in the randomForest model is ntree = 500. The code below aims to increase the number of trees 3 times in 250 increments to see if it improves the confusion matrix metrics.
Random forest ntree = 500, 750, 1000 and 1250. Accuracy(Red), Sensitivty(Green), Specificty(Blue) and Precision(Black)
ntree = 1000 gave the best results for all metrics except sensitivity which decrease by approximately 0.1%.
Next the maximum number of nodes is altered and models are compared with ntree = 1000. The maxnodes parameter was varied 2500, 3000, 4000 and 5000.
Random forest maxnodes = 2500,3000,4000 and 5000. Accuracy(Red), Sensitivty(Green), Specificty(Blue) and Precision(Black)
Given the small accuracy changes due to the tuning attempts above, the final model is kept as done before with default parameters.
AUC = 0.75 an acceptable results.
Next a KNN model will be built and compared with the final Random Forest model above.
7.2 KNN
In order for the kNN model package to give the best result- the data is first scaled and all categorical columns are mapped to 1s and 0s. The same seed number and split percentages are used in order to be able to compare the model with Random Forest model.
[1] 0.6999722
[1] 12582
[1] 5393
Plotting the accuracy of kNN with default parameters and varying k between 1-31 gives the restuls below for the same features used:
Accuracy vs k- kNN model
From the plot above k=5 gives the best accuracy.
Validation using training data results are shown below:
olympics_5NN1
Bronze Gold Silver
4709 4369 3504
Validation using test data results are shown below:
olympics_5NN2
Bronze Gold Silver
1995 1897 1501
Confusion matrix for both training and test validations are shown below:
Reference
Prediction Bronze Gold Silver
Bronze 3393 595 721
Gold 491 3225 653
Silver 335 358 2811
Reference
Prediction Bronze Gold Silver
Bronze 1284 329 382
Gold 353 1219 325
Silver 260 215 1026
Below is the summary of Accuracy, Sensitivity, Specificity and Precision for the training and test sets at k = 5.
| Model | Accuracy | Sensitivity | Specificity | Precision |
|---|---|---|---|---|
| Train | 0.7494039 | 0.804219 | 0.7719004 | 0.6716846 |
| Test | 0.6543668 | 0.6768582 | 0.6914351 | 0.5920369 |
An the results for Random Forest
| Model | Accuracy | Sensitivity | Specificity | Precision |
|---|---|---|---|---|
| Train | 0.8208552 | 0.8343209 | 0.8528004 | 0.7753883 |
| Test | 0.6690154 | 0.6700053 | 0.7197958 | 0.6162724 |
The results show that random Forest model give better results than kNN.
8 Pandemic (Spanish Flu)
With the novel corona virus pandemic this year and its delay of the Olympics in Tokyo this year, we thought it would be interesting to study a last pandemic in the last century and analyze the impact it had on the Olympic performance. The following countries in Europe had 2.64 million excess deaths occurred during the period when the H1N1 Pandemic (also commonly called Spanish Flu) was circulating from January 1918 - June 1919: Italy, Bulgaria, Portugal, Spain, Netherlands, Sweden, Germany, Switzerland, France, Norway, Denmark, UK (Scotland, England, Wales). In the US, 675,000 people died from H1N1 which was 0.8 percent of the 1910 population.
(JOHNSON, NIALL P. A. S., and JUERGEN MUELLER. “Updating the Accounts: Global Mortality of the 1918-1920 ‘Spanish’ Influenza Pandemic.” Bulletin of the History of Medicine, vol. 76, no. 1, 2002, pp. 105–115. JSTOR, www.jstor.org/stable/44446153. Accessed 19 Apr. 2020.) Taken from1
Of the European countries that suffered significant excess deaths during the Spanish Influenza Pandemic, these countries competed before and after 1918-1919: Denmark (DEN), France (FRA), Great Britain (GBR), Italy (ITA), Netherlands (NED), Norway (NOR), Sweden (SWE), and United States (USA). We created a separate pandemic data set containing athletes from these countries that competed in the Olympics between 1908-1928 to study before and after the pandemic.
8.1 Medals Earned
The plot shows number of medals earned by Denmark (DEN), France (FRA), Great Britain (GBR), Italy (ITA), Netherlands (NED), Norway (NOR), Sweden (SWE), and United States (USA) before and after the pandemic. There may be more than Gold, Silver or Bronze medals earned by individuals in any sporting event; accounting for the group events. Great Britain (GBR), Denmark (DEN), Sweden (SWE) saw a decline in the number of medals their athletes earned after the pandemic. The Olympics were not held in 1916 due to World War I.
8.2 Number of Olympic Athletes
Looking at the same countries, the plot shows the number of athletes they sent to the Olympics from 1908 - 1928. Great Britain and Sweden saw a sharp decline in the number of athletes that they sent to Olympic events after the pandemic. JOHNSON, NIALL P. A. S., and JUERGEN MUELLER reported England & Wales had approximately 200,000 death toll (per 1,000) and Sweden reported 34,374 death toll during the 1918-1919 pandemic.
8.3 Average Age of Olympians
The chart displays the average age of Olympians from the eight countries in our data before and after the pandemic.
The H1N1 Influenza pandemic “Spanish flu” was fatal for individuals aged 20–40 years. The average age of Olympians competing after the pandemic was increased for all countries in our data set.
8.4 Average Height and Weight of Olympians
Netherlands (NED) and Norway (NOR) had significant increase in the average height and weight of their athletes after Spanish flu pandemic. Sweden (SWE) and France (FRA) saw a decrease in those averages.
8.5 Total Number of Olympic Medals (Summer Events) vs. Year
8.6 Creating Time Series - Italy
During the 1918 Pandemic, Italy’s death toll (per 1,000) was 390,000 death toll. According to Worldodometer, during the current Covid-19 pandemic, there has been 29,079 deaths in Italy. We will focus on Italy to conduct time series analysis. Can we see a pattern with historical Olympic data before and after the Spanish flu in order to predict how Italy will fare in future Olympics after is emerges from the current novel Corona virus pandemic?
The Olympics data for Italy from 1908-1928 was converted to a time series and plotted “Total Number of Medals vs Year”.
The time series shows random fluctuations in the data over time, no overall trend. The Auto Correlated Function (ACF) plot does not show seasonality, periodicity and cyclic nature of the series. The ACF values are less than 0.05, so they are not significant. So the number of medals may not be correlated to time.
8.7 Using 1908-1928 Time Series as Training Data
From the quick EDA, time series may not be appropriate method to model the number of medals Italy may earn in future Olympics. As a learning academic exercise, we will explore different time series methodologies to forecast.
8.8 Exploring Holt-Winters and ETS-ANN Time Series Forecasting
Holt-Winters uses exponential smoothing to make short-term forecasts. Models is designated as either additive or multiplicative.
Length Class Mode
fitted 10 mts numeric
x 6 ts numeric
alpha 1 -none- numeric
beta 1 -none- logical
gamma 1 -none- logical
coefficients 1 -none- numeric
seasonal 1 -none- character
SSE 1 -none- numeric
call 4 -none- call
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
2012 129.7571 80.88325 178.6310 55.01098 204.5032
2016 129.7571 67.63072 191.8835 34.74299 224.7712
2020 129.7571 56.74531 202.7689 18.09519 241.4190
8.9 Time Series Linear Model for Italy
Above plots try to fit the number of medals to an algorithm that could would best be able to predict the number of Olympic medals Italy will earn after Covid-19 pandemic. Visually, Cubic Spline seems closest in approximation.
8.10 Arima Model
Series: olympic_all
ARIMA(0,0,0) with non-zero mean
Coefficients:
mean
107.6000
s.e. 11.1972
sigma^2 estimated as 2015: log likelihood=-77.83
AIC=159.66 AICc=160.66 BIC=161.07
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set 1.98952e-14 43.36635 37.70667 -18.08371 41.84652 0.3504337
ACF1
Training set 0.2284556
The Olympic dat set was divided into a training set (1908-1928) which includes the 1918-1919 Spanish Influenza pandemic and test set (2008-2016). Using MAPE (Mean Absolute Percent Error) measures the size of the error in percentage terms (Taken from2) to evaluate Holt-Winters and ETS-ANN time series predictions for 2012 and 2016 for Medals won by Italy. The table below show MAPE values less than 10% which is a reasonable model.
| Year | Actual | Predicted | MAPE (%) |
|---|---|---|---|
| 2012 | 132 | 130 | 9.72 |
| 2016 | 144 | 130 | 1.53 |
| 2020 | NA | 130 | NA |
ARIMA Time Series model predicts 107 medals for Italy in 2020. We will have to wait until the next Olympics to be able to evaluate the accuracy of these predictions for Italy’s performance.
The analysis of Olympic Medals that Italy earned before and after the 1918-1919 Spanish flu Pandemic and more current data were constructed into a time series on a superficial level. The resulting Olympic time series data on Italy could not be easily decomposed into the main components of a time series: trend, season or irregular fluctations. There are not hidden information we could be deciphered from studying 1918-1919 pandemic that could inform the future after Covid-19.
9 Trends over time
9.1 Creating the models
I’m going to try 4 different models:
\[ y_{\text{linear}}(x) = ax+b \\ y_{\text{exponential}}(x) = a\exp(bx) + c \\ y_{\text{quadratic}}(x) = ax^2 + bx + c \\ y_{\text{cubic}}(x) = ax^3 + bx^2 + cx + d \]
The first model is our traditional linear regression whereas the other 3 models are nonlinear: exponential, quadratic polynomial, and cubic polynomial. I will use ANOVA to compare the nested models, and I will also use the reduce chi-square “rule of thumb” to test my fits. The F-test is sensitive to over-fitting in a way that the chi-square statistic is not.
9.2 Chi-square “rule of thumb”
We consider \(m\) equations that relate the \(n\) random variables with values
\[ y_j = f(a_1,...,a_m,x_j) + \epsilon_j, ~~~~~ j = 1,...,n \]
If we assume that only the \(n-m\) random variables can fluctuate independently, and that the data uncertainties follow a Normal distribution, then the resulting chi-square is expected to be distributed according to
\[ \chi^2_{\nu}, ~~~~~ \nu = n - m \]
Which we can compare to the “rule of thumb”, which is
\[ \text{if } \frac{\chi^2}{n-m} \approx 1 ~~~ \text{where }\nu = n - m \implies \text{ "a good fit"} \]
If we have \(\nu\) independent RVs and the \(x_i\) are each normally distributed with mean \(\mu_i\) and variance \(\sigma_i^2\), then the chi-square is,
\[ \frac{\chi^2}{n-m} = \frac{\sum_{i=1}^{\nu} (x_i-\mu_i)^2/\sigma_i^2}{n-m} \]
where \(n\) is the length of our y data and \(m\) is the degree of the polynomial. Using the above equation, we can calculate the values for our \(\chi^2\) “rule of thumb”.
9.3 Number of Events
First let’s look to see if we can model the number of Olympic Games over the years. This will require us to import some more data from https://www.topendsports.com/events/summer/sports/number.htm.3
There is clearly an upward trend, but no seasonal pattern. The data is also a little choppy at the beginning. Part of the explanation is that the data points are not evenly spaced. Most Olympic games are 4 years apart, but a few of them are just 2 years apart, and during World War I and World War II there were 8-year and 12-year gaps, respectively. Since time series data should be evenly spaced over time, we’ll only look at data from 1948 on, when the Olympics started being held every 4 years without any interruptions.
Now I will try the model fits on the number of events per Olympic Games data.
Now that we have our models we can run ANOVA and the chi-square
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 17 | 33.33158 | NA | NA | NA | NA |
| Exponential | 16 | 32.28073 | 1 | 1.050845 | 0.520853 | 0.4808928 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 17 | 33.33158 | NA | NA | NA | NA |
| Quadratic | 16 | 30.41244 | 1 | 2.919136 | 1.5357588 | 0.2331213 |
| Cubic | 15 | 29.04971 | 1 | 1.362737 | 0.7036577 | 0.4147260 |
| Model | Chi^2/d.o.f |
|---|---|
| Exponential | 0.076 |
| Linear | 0.084 |
| Quadratic | 0.081 |
| Cubic | 0.083 |
According to the ANOVA test the best fit is the linear line and adding complexity to the model does not help us very much. F-tests are more sensitive to over-fitting than the chi-square test, so even though the chi-square/dof statistic shows a preference for the more complex models we sort of anticipated that result.
9.3.1 Monte Carlo Simulations for Number of Events
Now I am going to run the MC simulations of this data. I will introduce a pseudorandom noise in the y data using N(0,\(\sigma\)), where the standard deviation has been calculated from the original y data itself.
Now that we’ve generated our data lets fit it and see what values we get for the parameters \(a\) and \(b\) (these are the parameters from our linear model fit \(y = ax+b\)).
[[1]]
Nonlinear regression model
model: ydata_long[[k]] ~ lin_func(xdata, a, b)
data: parent.frame()
a b
0.2285 -429.6933
residual sum-of-squares: 363.6
Number of iterations to convergence: 2
Achieved convergence tolerance: 1.49e-08
[[2]]
Nonlinear regression model
model: ydata_long[[k]] ~ lin_func(xdata, a, b)
data: parent.frame()
a b
0.1642 -303.2673
residual sum-of-squares: 286.1
Number of iterations to convergence: 2
Achieved convergence tolerance: 1.49e-08
I just showed 2 of our models for the above Monte Carlo simulations. I’m going to make a data frame from the parameter values and the chi-square and reduced chi-square values, for ease of future use.
Now I can plot the values of the parameters.
$cov
[,1] [,2]
[1,] 0.002776332 -5.505294
[2,] -5.505294040 10917.803748
$center
[1] 0.2078603 -389.5696527
$n.obs
[1] 1000
eigen() decomposition
$values
[1] 1.091781e+04 2.921320e-07
$vectors
[,1] [,2]
[1,] -0.0005042492 -0.9999998729
[2,] 0.9999998729 -0.0005042492
[1] 2.561449e+02 1.324975e-03
Here we can see the scatterplots made from the possible parameter values from the Monte Carlo simulations. What we are doing here is jittering the possible y-values that we can get and trying to see if the parameter values converge on a solution. That solution is the center of the ellipse shown above, where we have include the 68% confidence interval. For linear regression we should have an agreement between the parameter values found in this method and the ones found from the model fit itself. This method of using MC sims comes more into play for out nonlinear model fits, which are preferred in certain scenarios.
9.4 Sports
I’m going to not look at all the sports but only a few of them in order to create my models. I’m going to choose “top” sports by the numbers of atheletes participating in them.
9.4.1 Swimming
9.4.1.1 Female Swimmers
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 7.903133 | NA | NA | NA | NA |
| Exponential | 12 | 7.579076 | 1 | 0.3240577 | 0.5130828 | 0.4875138 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 7.9031333 | NA | NA | NA | NA |
| Quadratic | 12 | 3.8146391 | 1 | 4.088494 | 12.86149 | 0.0037388 |
| Cubic | 11 | 0.6256824 | 1 | 3.188957 | 56.06442 | 0.0000122 |
| Model | Chi^2/d.o.f |
|---|---|
| Exponential | 1.75 |
| Linear | 1.855 |
| Quadratic | 1.861 |
| Cubic | 2.005 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 12.41612 | NA | NA | NA | NA |
| Exponential | 12 | 12.06339 | 1 | 0.352724 | 0.3508705 | 0.5646156 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 12.416116 | NA | NA | NA | NA |
| Quadratic | 12 | 6.592881 | 1 | 5.823234 | 10.59913 | 0.0068842 |
| Cubic | 11 | 2.695634 | 1 | 3.897247 | 15.90339 | 0.0021298 |
| Model | Chi^2/d.o.f |
|---|---|
| Exponential | 64.307 |
| Linear | 68.114 |
| Quadratic | 72.674 |
| Cubic | 77.57 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 4.783949 | NA | NA | NA | NA |
| Exponential | 12 | 4.837242 | 1 | -0.0532928 | -0.1322063 | 1 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 4.783949 | NA | NA | NA | NA |
| Quadratic | 12 | 4.726270 | 1 | 0.0576792 | 0.1464475 | 0.7086457 |
| Cubic | 11 | 2.849233 | 1 | 1.8770373 | 7.2466564 | 0.0209556 |
| Model | Chi^2/d.o.f |
|---|---|
| Exponential | 972.298 |
| Linear | 1029.463 |
| Quadratic | 1093.665 |
| Cubic | 1166.649 |
9.4.1.1.1 MC simulations
We saw that for the mean ages of female swimming the cubic nonlinear model was much preferred. I’m going to perform MC simulations for that model.
I will do 500 simulations.
a.values b.values c.values d.values
1 -0.0001212230 0.7250847 -1445.503 960471.7
2 -0.0001151618 0.6887442 -1372.882 912102.2
3 -0.0001176500 0.7036960 -1402.823 932083.3
4 -0.0001010282 0.6045280 -1205.615 801364.9
5 -0.0001152262 0.6889873 -1373.081 912043.4
6 -0.0001258358 0.7523375 -1499.162 995681.4
Here we can see the histograms of the parameters, the correlations of the parameters with each other, and the scatterplots. We want the centers of those scatterplots/ellipses to get the values of our parameters \(a,b,c,d\) from the cubic nonlinear model \(y=ax^3 + bx^2 + cx + s\).
9.4.1.2 Male Swimmers
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 3.748118 | NA | NA | NA | NA |
| Exponential | 12 | 3.600919 | 1 | 0.1471987 | 0.4905372 | 0.4970431 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 3.748118 | NA | NA | NA | NA |
| Quadratic | 12 | 2.241577 | 1 | 1.506541 | 8.065079 | 0.0148993 |
| Cubic | 11 | 1.009733 | 1 | 1.231844 | 13.419671 | 0.0037331 |
| Model | Chi^2/d.o.f |
|---|---|
| Exponential | 1.089 |
| Linear | 1.154 |
| Quadratic | 1.171 |
| Cubic | 1.26 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 10.70453 | NA | NA | NA | NA |
| Exponential | 12 | 10.70814 | 1 | -0.0036119 | -0.0040476 | 1 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 10.704526 | NA | NA | NA | NA |
| Quadratic | 12 | 10.701317 | 1 | 0.003209 | 0.0035984 | 0.9531535 |
| Cubic | 11 | 2.489111 | 1 | 8.212206 | 36.2917839 | 0.0000862 |
| Model | Chi^2/d.o.f |
|---|---|
| Exponential | 132.438 |
| Linear | 140.215 |
| Quadratic | 148.989 |
| Cubic | 159.004 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 10.38947 | NA | NA | NA | NA |
| Exponential | 12 | 10.92533 | 1 | -0.5358625 | -0.5885724 | 1 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 10.389471 | NA | NA | NA | NA |
| Quadratic | 12 | 3.939198 | 1 | 6.4502728 | 19.649498 | 0.0008171 |
| Cubic | 11 | 3.101914 | 1 | 0.8372844 | 2.969176 | 0.1128210 |
| Model | Chi^2/d.o.f |
|---|---|
| Exponential | 1155.739 |
| Linear | 1223.715 |
| Quadratic | 1298.614 |
| Cubic | 1385.236 |
We can see that the chi-square/d.o.f. blows up for both female and male swimmers when considering their weights and heights. This is because the calculation of the chi-square statistic uses the standard deviation, and the standard deviation needs to be both independent and normally-distributed. This is true for the ages but not for the weights and heights of the athletes, so the chi-square statistic is unreliable. Instead we must rely on the ANOVA/F-test.
9.4.2 Athletics
9.4.2.1 Female Athletes
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 5.529516 | NA | NA | NA | NA |
| Exponential | 12 | 5.612178 | 1 | -0.0826618 | -0.1767481 | 1 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 5.529516 | NA | NA | NA | NA |
| Quadratic | 12 | 5.228328 | 1 | 0.3011881 | 0.6912834 | 0.4219646 |
| Cubic | 11 | 1.452994 | 1 | 3.7753346 | 28.5814615 | 0.0002352 |
| Model | Chi^2/d.o.f |
|---|---|
| Exponential | 0.972 |
| Linear | 1.032 |
| Quadratic | 1.119 |
| Cubic | 1.213 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 28.11069 | NA | NA | NA | NA |
| Exponential | 12 | 28.05736 | 1 | 0.0533365 | 0.0228118 | 0.8824568 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 28.11069 | NA | NA | NA | NA |
| Quadratic | 12 | 22.10677 | 1 | 6.003922 | 3.259050 | 0.0961630 |
| Cubic | 11 | 16.03856 | 1 | 6.068212 | 4.161866 | 0.0660967 |
| Model | Chi^2/d.o.f |
|---|---|
| Exponential | 61.298 |
| Linear | 64.927 |
| Quadratic | 69.3 |
| Cubic | 73.995 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 9.182104 | NA | NA | NA | NA |
| Exponential | 12 | 9.179206 | 1 | 0.0028981 | 0.0037887 | 0.9519325 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 9.182104 | NA | NA | NA | NA |
| Quadratic | 12 | 9.166367 | 1 | 0.0157370 | 0.0206019 | 0.888251 |
| Cubic | 11 | 8.689804 | 1 | 0.4765626 | 0.6032574 | 0.453716 |
| Model | Chi^2/d.o.f |
|---|---|
| Exponential | 961.677 |
| Linear | 1018.249 |
| Quadratic | 1081.963 |
| Cubic | 1154.131 |
9.4.2.2 Male Athletes
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 0.8160019 | NA | NA | NA | NA |
| Exponential | 12 | 0.8030124 | 1 | 0.0129895 | 0.1941123 | 0.6673479 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 0.8160019 | NA | NA | NA | NA |
| Quadratic | 12 | 0.6826503 | 1 | 0.1333517 | 2.344129 | 0.1516830 |
| Cubic | 11 | 0.5394194 | 1 | 0.1432308 | 2.920805 | 0.1154692 |
| Model | Chi^2/d.o.f |
|---|---|
| Exponential | 1.253 |
| Linear | 1.328 |
| Quadratic | 1.402 |
| Cubic | 1.498 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 17.50076 | NA | NA | NA | NA |
| Exponential | 12 | 17.47744 | 1 | 0.0233258 | 0.0160155 | 0.9013904 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 17.50076 | NA | NA | NA | NA |
| Quadratic | 12 | 17.44997 | 1 | 0.0507891 | 0.0349266 | 0.8548719 |
| Cubic | 11 | 10.31043 | 1 | 7.1395474 | 7.6170489 | 0.0185592 |
| Model | Chi^2/d.o.f |
|---|---|
| Exponential | 116.888 |
| Linear | 123.758 |
| Quadratic | 131.532 |
| Cubic | 140.38 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 2.745302 | NA | NA | NA | NA |
| Exponential | 12 | 2.787119 | 1 | -0.0418169 | -0.1800435 | 1 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 2.745302 | NA | NA | NA | NA |
| Quadratic | 12 | 2.357623 | 1 | 0.3876792 | 1.973237 | 0.1854577 |
| Cubic | 11 | 1.721252 | 1 | 0.6363706 | 4.066851 | 0.0688116 |
| Model | Chi^2/d.o.f |
|---|---|
| Exponential | 1111.757 |
| Linear | 1177.154 |
| Quadratic | 1250.336 |
| Cubic | 1333.737 |
9.4.3 Gymnastics
9.4.3.1 Female Gymnasts
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 27.93686 | NA | NA | NA | NA |
| Exponential | 12 | 27.74453 | 1 | 0.1923255 | 0.0831842 | 0.7779491 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 27.936859 | NA | NA | NA | NA |
| Quadratic | 12 | 2.260982 | 1 | 25.6758762 | 136.272846 | 0.0000001 |
| Cubic | 11 | 2.022192 | 1 | 0.2387902 | 1.298933 | 0.2786178 |
| Model | Chi^2/d.o.f |
|---|---|
| Exponential | 1.935 |
| Linear | 2.04 |
| Quadratic | 2.004 |
| Cubic | 2.144 |
I was unable to fit the weight and height data due to its highly irregular shape, and it has thus not been included.
9.4.3.1.1 MC Sims
I am going to fit the Female Gymnasts age data with a quadratic model, because it is preferred by the ANOVA test.
a.values b.values c.values
1 0.005073795 -20.20323 20128.49
2 0.004882513 -19.44384 19374.81
3 0.004994383 -19.88447 19808.69
4 0.004958735 -19.74279 19667.91
5 0.004741524 -18.87938 18809.97
6 0.004815924 -19.17375 19101.08
From the simulations we see what the values of the parameters trend towards, which are the centers of the ellipses. We have to check to see what the parameter values \(a,b,c\) from the quadratic nonlinear model \(y=ax^2 + bx +c\) will be because this is not linear regression.
9.4.3.2 Male Gymnasts
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 15.68979 | NA | NA | NA | NA |
| Exponential | 12 | 15.59344 | 1 | 0.0963473 | 0.0741445 | 0.7900239 |
| Models | Res.Df | Res.Sum Sq | Df | Sum Sq | F value | Pr(>F) |
|---|---|---|---|---|---|---|
| Linear | 13 | 15.689788 | NA | NA | NA | NA |
| Quadratic | 12 | 3.505150 | 1 | 12.184637 | 41.714517 | 0.0000312 |
| Cubic | 11 | 1.989745 | 1 | 1.515406 | 8.377689 | 0.0145899 |
| Model | Chi^2/d.o.f |
|---|---|
| Exponential | 1.081 |
| Linear | 1.146 |
| Quadratic | 1.16 |
| Cubic | 1.257 |
9.5 Results from models over time
We can see from our models that you might not want to always use linear regression, and that a nonlinear model might be better suited. We can check if a nonlinear model is preferred by using ANOVA or the chi-square test, but sometimes there might be issues with these tests (such as the standard deviation of the data not being independent or normally distributed).
For these nonlinear models we need to perform something like Monte Carlo simulations, where we “jitter” the y-data and perform many simulations to see if the parameter values of the model will change as we have more and more data (frequentist approach). From these MC simulations then we can see what values the parameters converge upon by looking at their correlations to one another, the histograms of their values, and the scatterplots of two of the parameters and the resulting confidence ellipses. The centers of these ellipses will be the value of the parameter that should ultimately be used in our model.
10 Summary and Conclusions
10.1 KMeans & KMedoids
- On continuous categories: Age, Weight, Height, Population, GDP
- Triathlon: are statistically significant clusters (H >= 0.77) and Population+GDP appear to add another cluster
- Softball: are statistically significant clusters (H >= 0.77) however Population+GDP do not appear to add another cluster in this case/the clusters, if present, are interwoven/on top of one another
10.2 Logistic Regression
- Difficult to model Medals: Yes No for entire data set
- modeling Medals: Yes No more successful when subsetting data by Sport
- Some sports like Softball and Baseball can be very accurately modelled with only Population and Team, whereas Swimming requires Age, Height, and Team
10.3 KNN vs Random Forest
- Better to model Bronze/Silver/Gold versus Model Yes No
- Random Forest higher accuracy and 4 models had highest accuracy
10.4 Pandemic
- Time series analysis using Holt-Winters and ARIMA
- Need to employ evaluation techniques on models
- Periodic behavior observed
- Study of 1918-1919 Spanish Influenza Pandemic did not reveal any major conclusions to the current Covid-19 pandemic. World War I also coincided with the Spanish Flu.
10.5 Trends Over Time
- 4 models: linear, exponential, quadratic, cubic
- Test w/ ANOVA & chi-square/dof statistics results change w/ sport, sex, age, weight, and height variety of models preferred; not just linear
- Check parameter values using MC simulations
References
1 N.P.A.S. Johnson and J. Mueller, Bulletin of the History of Medicine 76, 105 (2002).
2 E. Stellwagen, Forecasting 101: A Guide to Forecast Error Measurement Statistics and How to Use Them (n.d.).
3 R. Wood, (n.d.).